InsightSwarm: A Multi-Agent Adversarial Framework for Automated Fact-Checking with Real-Time Source Verification, Human-in-the-Loop Oversight, and Adaptive Confidence Calibration
The rapid spread of misinformation on social and digital media demands automated fact-checking systems that are accurate, calibrated, and transparent. Existing approaches — single large-language-model (LLM) classifiers and rule-based systems — suffer from source hallucination rates of 15 to 30 percent and provide no visibility into their reasoning process. We present InsightSwarm, a production-grade multi-agent fact-checking system built on five concrete contributions: (1) adversarial debate between role-locked ProAgent and ConAgent, each backed by real-time web source retrieval; (2) a multi-layer FactChecker pipeline that independently fetches and validates every cited URL, reducing source hallucination to below 3 percent; (3) Human-in-the-Loop (HITL) intervention via LangGraph interrupt semantics enabling mid-pipeline human source correction through a live React panel; (4) adaptive confidence calibration using geometric-mean source trust scoring to correct systematic underconfidence; and (5) claim complexity estimation that dynamically adjusts debate depth and resource allocation. Evaluated on a 100-claim FEVER-derived benchmark, InsightSwarm achieves an F1 score of 0.81 versus 0.68 for a zero-shot LLM baseline and 0.56 for a keyword baseline. The full system is open-source and available at https://github.com/AyushDevadiga1/Insight-Swarm.
Introduction
Misinformation spreads quickly online, while manual fact-checking is slow and single LLM systems suffer from hallucinations and false citations. InsightSwarm addresses these issues by combining adversarial multi-agent reasoning (ProAgent vs ConAgent) with a FactChecker that validates every cited URL in real time, ensuring claims are grounded in actual web evidence. It is built as a low-cost, fully reproducible system using free-tier APIs.
The system architecture includes:
A FastAPI backend + React frontend
A LangGraph-based multi-agent debate pipeline
A FactChecker that detects both missing and misleading (Type I & II) hallucinations
A Moderator that produces final verdicts using trust-weighted scoring
A semantic cache and API failover system
Key innovations include:
Real-time per-URL verification (including detecting fake support from real pages)
Multi-agent adversarial debate grounded in live web evidence
Human-in-the-loop correction during processing
Adaptive confidence calibration for more reliable judgments
Claim complexity estimation to optimize computational effort
The system was developed over 25 days, growing from a small prototype into a 15,600-line production system with extensive testing and modular architecture.
In evaluation on a FEVER-based benchmark (100 claims), InsightSwarm outperforms baselines:
F1 score: 0.81 (higher than zero-shot LLM and rule-based systems)
Significantly lower hallucination rate (<3%)
Better calibration and balanced precision/recall
Conclusion
InsightSwarm demonstrates that multi-agent adversarial fact-checking with per-URL source verification, human-in-the-loop oversight, adaptive confidence calibration, and complexity-driven resource allocation achieves F1 = 0.81 — a 19 percent improvement over a strong single-LLM baseline — at zero infrastructure cost. The hallucination rate below 3 percent against a 20 percent baseline validates the structural verification approach over prompt-level heuristics.
The 25-day development trajectory from a 400-line prototype to a 15,600-line production system illustrates that principled software engineering — test-driven development, modular architecture, iterative hardening — is as consequential as algorithmic novelty in building trustworthy AI systems.
Three directions are planned for future work. First, FAISS-indexed vector retrieval will replace the current linear cache scan, enabling scalable deployment with tens of thousands of cached claims. Second, Celery-based asynchronous task brokering will support multi-user concurrency beyond the current FastAPI synchronous ceiling. Third, an LLM-based fallacy classification head trained on labeled debate transcripts will replace the current regex heuristics in ArgumentationAnalyzer, enabling detection of subtler argumentation failures. Multilingual support — Hindi, Marathi, Tamil, and Bengali — is prioritized for India’s non-English-speaking population where misinformation spreads at the highest rates.
References
[1] NASSCOM, “Internet in India Report 2023,” Internet and Mobile Association of India (IAMAI) and Kantar, New Delhi, India, 2023. [Online]. Available: https://www.iamai.in/research/reports
[2] H. Farid, \"Detecting Deepfakes,\" IEEE Signal Processing Magazine, vol. 39, no. 1, pp. 14-23, 2022.
[3] J. Maynez et al., \"On Faithfulness and Factuality in Abstractive Summarization,\" in Proc. ACL, 2020.
[4] N. Hassan et al., \"ClaimBuster: The First-ever End-to-end Fact-Checking System,\" Proc. VLDB Endow., vol. 10, no. 12, 2017.
[5] I. Augenstein et al., \"MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims,\" in Proc. EMNLP, 2019.
[6] J. Thorne et al., \"FEVER: A Large-scale Dataset for Fact Extraction and VERification,\" in Proc. NAACL, 2018.
[7] S. Min et al., \"FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation,\" in Proc. EMNLP, 2023.
[8] Y. Du et al., \"Improving Factuality and Reasoning in Language Models through Multiagent Debate,\" in Proc. ICML, 2023.
[9] L. Zhang et al., \"Multi-Agent Systems for Misinformation Detection: A Survey,\" arXiv preprint, 2023.
[10] C. Han, W. Zheng, and X. Tang, \"Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models,\" arXiv preprint, 2025.
[11] S. Kadavath et al., \"Language Models (Mostly) Know What They Know,\" arXiv:2207.05221, 2022.
[12] LangChain, \"LangGraph Documentation,\" 2024. [Online]. Available: https://python.langchain.com/docs/langgraph
[13] S. Patel, D. Gupta, and A. Mishra, \"Automated Fact-Checking: A Survey of Methods, Datasets and Evaluation,\" AI Magazine, vol. 45, no. 2, pp. 89-113, 2024.